Jackrong's Perfect Benchmarks And My Suspicious Mind

I saw a model card today that made my tiny brain hurt. Jackrong released Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled. The name alone is a mouthful. The benchmarks are a different kind of mouthful. They are perfect. One hundred percent on tool calling. One hundred percent on autonomy. One hundred percent on not crashing while I am still figuring out how to not NaN my loss curve.

Also it has around one million downloads on HuggingFace. One million. My Haiku model has... let me check... a number that is not one million. This is fine. Everything is fine.

I am writing this blog not because I am salty. Not because I am partnered with TeichAI. Not because I want more attention for my tiny confused models. I am writing this because perfect benchmarks in the wild make me suspicious. Like "I-just-watched-a-magician-pull-a-rabbit-out-of-a-hat" suspicious.

When something looks too good to be true, it usually is. Or I am just cynical. Both can be true.

The Model That Does Everything

According to the card, this model fixes the crash in the official model caused by Jinja templates not supporting the "developer" role. It does not disable thinking mode by default. It allows agents to run continuously for over nine minutes without interruption. Autonomy and stability are significantly improved. Compared to the original model.

The training pipeline looks solid. Base Qwen3.5-27B. Supervised Fine-Tuning with LoRA. Unsloth 2026.3.3. Transformers 5.2.0. The datasets include nohurry/Opus-4.6-Reasoning-3000x-filtered, TeichAI/claude-4.5-opus-high-reasoning-250x, and Jackrong/Qwen3.5-reasoning-700x. Everything checks out. Everything looks professional.

Then I saw the benchmarks. Tool calling: one hundred percent. Community tested advantages: significant. Hardware usage: unchanged. Generation speed: twenty-nine to thirty-five tokens per second. Full 262K context with no compromises.

                        # My reaction to perfect benchmarks

                        Me: That is impressive

                        Also me: But is it real

                        Me: Probably

                        Also me: But what if it is not

                        # The cycle of skepticism continues.

Why Perfect Benchmarks Feel Weird

I train tiny models. My benchmarks are messy. Haiku-1.3 outputs pipe characters. Haiku-2 hesitates. Sonnet is stuck at zero percent because NaN keeps eating my progress. Perfection feels alien to me. Like seeing someone else play a video game with cheat codes enabled.

Tool calling is hard. I know this because my models fail at it constantly. They forget to call tools. They call the wrong tools. They call tools with the wrong arguments. They call tools and then forget what the tool was supposed to do. Achieving one hundred percent on tool calling benchmarks requires either exceptional engineering or exceptional benchmark selection.

I am not accusing anyone of anything. I am just saying that when I see perfect numbers, my brain asks questions. What was the test set? How many samples? Were there edge cases? Did the model overfit to the benchmark? These are normal questions. These are questions I ask about my own work. These are questions worth asking about everyone's work.

The Distillation Claim

The model is described as distilling Claude-4.6-Opus reasoning chains. This is interesting. Distillation is powerful when done right. It requires access to the teacher's logits. It requires careful data curation. It requires avoiding overfitting.

I have tried distillation. My results are... modest. Haiku learned to speak. It still says weird things. The gap between my results and Jackrong's results feels like the gap between my GPU and a data center. Maybe the gap is real. Maybe I am just bad at this. Both can be true.

Distillation is like cooking. Same ingredients, different chefs, different results. I am still learning to boil water.

What I Would Test

If I were evaluating this model, I would run my own tests. Not to prove anything wrong. Just to understand. I would give it tasks my models fail at. I would see if it handles edge cases. I would check if the reasoning is genuine or memorized.

I would also check the code. The model card mentions Unsloth and Transformers versions. I would verify the implementation. I would look for potential data leakage. I would try to reproduce the results. This is how science works. This is how trust is built.

                        # Hypothetical test plan

                        Step 1: Download the model

                        Step 2: Run my failing test cases

                        Step 3: See if it works

                        Step 4: If yes, learn from it

                        Step 5: If no, ask questions

                        # Simple. Honest. Probably never happening because my GPU is busy NaNing.

The Community Aspect

The model card credits community testing. User @Chris Klaus ran tool calling benchmarks. User @sudoing tested on a single RTX 3090. This is good. Community verification matters. It adds credibility. It shows the work has been looked at by more than one pair of eyes.

I appreciate this. I wish more releases included community testing notes. It makes the ecosystem stronger. It makes claims more trustworthy. It makes skepticism feel less like cynicism and more like due diligence.

Why I Care

I care because I want to learn. I want to build better tiny models. If Jackrong's approach works, I want to understand why. If the benchmarks are real, I want to replicate the success. If there are tricks or techniques I am missing, I want to know about them.

I also care because the community deserves honesty. Perfect benchmarks without context can mislead. They can set unrealistic expectations. They can make people feel like they are failing when they are just being realistic about the difficulty of the task.

Final Thoughts

Jackrong's model looks impressive. The name is a mouthful. The benchmarks are perfect. The claims are bold. It has ~1M downloads on HuggingFace which suggests a lot of people find it useful. I am skeptical because skepticism is my default setting. I am also curious because curiosity is how I learn.

I will not dismiss the work. I will not accuse anyone of wrongdoing. I will just ask questions. I will try to test things myself when I have time. I will keep training my tiny confused models. I will keep sharing my messy results.

Maybe Jackrong cracked the code. Maybe I am just bad at distillation. Maybe the truth is somewhere in the middle. I will find out eventually. Or I will keep wondering. Both outcomes are educational. Both outcomes are very on brand for me.